{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 2
   },
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M4.04\n",
    "\n",
    "In the previous Module we tuned the hyperparameter `C` of the logistic\n",
    "regression without mentioning that it controls the regularization strength.\n",
    "Later, on the slides on \ud83c\udfa5 **Intuitions on regularized linear models** we\n",
    "mentioned that a small `C` provides a more regularized model, whereas a\n",
    "non-regularized model is obtained with an infinitely large value of `C`.\n",
    "Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`\n",
    "model.\n",
    "\n",
    "In this exercise, we ask you to train a logistic regression classifier using\n",
    "different values of the parameter `C` to find its effects by yourself.\n",
    "\n",
    "We start by loading the dataset. We only keep the Adelie and Chinstrap classes\n",
    "to keep the discussion simple."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
    "penguins = (\n",
    "    penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
    ")\n",
    "\n",
    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_column = \"Species\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "penguins_train, penguins_test = train_test_split(\n",
    "    penguins, random_state=0, test_size=0.4\n",
    ")\n",
    "\n",
    "data_train = penguins_train[culmen_columns]\n",
    "data_test = penguins_test[culmen_columns]\n",
    "\n",
    "target_train = penguins_train[target_column]\n",
    "target_test = penguins_test[target_column]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define a function to help us fit a given `model` and plot its decision\n",
    "boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging\n",
    "colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped\n",
    "to the white color. Equivalently, the darker the color, the closer the\n",
    "predicted probability is to 0 or 1 and the more confident the classifier is in\n",
    "its predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.inspection import DecisionBoundaryDisplay\n",
    "\n",
    "\n",
    "def plot_decision_boundary(model):\n",
    "    model.fit(data_train, target_train)\n",
    "    accuracy = model.score(data_test, target_test)\n",
    "    C = model.get_params()[\"logisticregression__C\"]\n",
    "\n",
    "    disp = DecisionBoundaryDisplay.from_estimator(\n",
    "        model,\n",
    "        data_train,\n",
    "        response_method=\"predict_proba\",\n",
    "        plot_method=\"pcolormesh\",\n",
    "        cmap=\"RdBu_r\",\n",
    "        alpha=0.8,\n",
    "        vmin=0.0,\n",
    "        vmax=1.0,\n",
    "    )\n",
    "    DecisionBoundaryDisplay.from_estimator(\n",
    "        model,\n",
    "        data_train,\n",
    "        response_method=\"predict_proba\",\n",
    "        plot_method=\"contour\",\n",
    "        linestyles=\"--\",\n",
    "        linewidths=1,\n",
    "        alpha=0.8,\n",
    "        levels=[0.5],\n",
    "        ax=disp.ax_,\n",
    "    )\n",
    "    sns.scatterplot(\n",
    "        data=penguins_train,\n",
    "        x=culmen_columns[0],\n",
    "        y=culmen_columns[1],\n",
    "        hue=target_column,\n",
    "        palette=[\"tab:blue\", \"tab:red\"],\n",
    "        ax=disp.ax_,\n",
    "    )\n",
    "    plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "    plt.title(f\"C: {C} \\n Accuracy on the test set: {accuracy:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now create our predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Influence of the parameter `C` on the decision boundary\n",
    "\n",
    "Given the following candidates for the `C` parameter and the\n",
    "`plot_decision_boundary` function, find out the impact of `C` on the\n",
    "classifier's decision boundary.\n",
    "\n",
    "- How does the value of `C` impact the confidence on the predictions?\n",
    "- How does it impact the underfit/overfit trade-off?\n",
    "- How does it impact the position and orientation of the decision boundary?\n",
    "\n",
    "Try to give an interpretation on the reason for such behavior."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]\n",
    "\n",
    "# solution\n",
    "for C in Cs:\n",
    "    logistic_regression.set_params(logisticregression__C=C)\n",
    "    plot_decision_boundary(logistic_regression)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "\n",
    "On this series of plots we can observe several important points. Regarding the\n",
    "confidence on the predictions:\n",
    "\n",
    "- For low values of `C` (strong regularization), the classifier is less\n",
    "  confident in its predictions. We are enforcing a **spread sigmoid**.\n",
    "- For high values of `C` (weak regularization), the classifier is more\n",
    "  confident: the areas with dark blue (very confident in predicting \"Adelie\")\n",
    "  and dark red (very confident in predicting \"Chinstrap\") nearly cover the\n",
    "  entire feature space. We are enforcing a **steep sigmoid**.\n",
    "\n",
    "To answer the next question, think that misclassified data points are more\n",
    "costly when the classifier is more confident on the decision. Decision rules\n",
    "are mostly driven by avoiding such cost. From the previous observations we can\n",
    "then deduce that:\n",
    "\n",
    "- The smaller the `C` (the stronger the regularization), the lower the cost\n",
    "  of a misclassification. As more data points lay in the low-confidence\n",
    "  zone, the more the decision rules are influenced almost uniformly by all\n",
    "  the data points. This leads to a less expressive model, which may underfit.\n",
    "- The higher the value of `C` (the weaker the regularization), the more the\n",
    "  decision is influenced by a few training points very close to the boundary,\n",
    "  where decisions are costly. Remember that models may overfit if the number\n",
    "  of samples in the training set is too small, as at least a minimum of\n",
    "  samples is needed to average the noise out.\n",
    "\n",
    "The orientation is the result of two factors: minimizing the number of\n",
    "misclassified training points with high confidence and their distance to the\n",
    "decision boundary (notice how the contour line tries to align with the most\n",
    "misclassified data points in the dark-colored zone). This is closely related\n",
    "to the value of the weights of the model, which is explained in the next part\n",
    "of the exercise.\n",
    "\n",
    "Finally, for small values of `C` the position of the decision boundary is\n",
    "affected by the class imbalance: when `C` is near zero, the model predicts the\n",
    "majority class (as seen in the training set) everywhere in the feature space.\n",
    "In our case, there are approximately two times more \"Adelie\" than \"Chinstrap\"\n",
    "penguins. This explains why the decision boundary is shifted to the right when\n",
    "`C` gets smaller. Indeed, the most regularized model predicts light blue\n",
    "almost everywhere in the feature space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Impact of the regularization on the weights\n",
    "\n",
    "Look at the impact of the `C` hyperparameter on the magnitude of the weights.\n",
    "**Hint**: You can [access pipeline\n",
    "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n",
    "by name or position. Then you can query the attributes of that step such as\n",
    "`coef_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "lr_weights = []\n",
    "for C in Cs:\n",
    "    logistic_regression.set_params(logisticregression__C=C)\n",
    "    logistic_regression.fit(data_train, target_train)\n",
    "    coefs = logistic_regression[-1].coef_[0]\n",
    "    lr_weights.append(pd.Series(coefs, index=culmen_columns))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "lr_weights = pd.concat(lr_weights, axis=1, keys=[f\"C: {C}\" for C in Cs])\n",
    "lr_weights.plot.barh()\n",
    "_ = plt.title(\"LogisticRegression weights depending of C\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "\n",
    "As small `C` provides a more regularized model, it shrinks the weights values\n",
    "toward zero, as in the `Ridge` model.\n",
    "\n",
    "In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature\n",
    "named \"Culmen Depth (mm)\" is almost zero. It explains why the decision\n",
    "separation in the plot is almost perpendicular to the \"Culmen Length (mm)\"\n",
    "feature.\n",
    "\n",
    "For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both\n",
    "features are almost zero. It explains why the decision separation in the plot\n",
    "is almost constant in the feature space: the predicted probability is only\n",
    "based on the intercept parameter of the model (which is never regularized)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Impact of the regularization on with non-linear feature engineering\n",
    "\n",
    "Use the `plot_decision_boundary` function to repeat the experiment using a\n",
    "non-linear feature engineering pipeline. For such purpose, insert\n",
    "`Nystroem(kernel=\"rbf\", gamma=1, n_components=100)` between the\n",
    "`StandardScaler` and the `LogisticRegression` steps.\n",
    "\n",
    "- Does the value of `C` still impact the position of the decision boundary and\n",
    "  the confidence of the model?\n",
    "- What can you say about the impact of `C` on the underfitting vs overfitting\n",
    "  trade-off?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "# solution\n",
    "classifier = make_pipeline(\n",
    "    StandardScaler(),\n",
    "    Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100, random_state=0),\n",
    "    LogisticRegression(max_iter=1000),\n",
    ")\n",
    "\n",
    "for C in Cs:\n",
    "    classifier.set_params(logisticregression__C=C)\n",
    "    plot_decision_boundary(classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "\n",
    "- For the lowest values of `C`, the overall pipeline underfits: it predicts\n",
    "  the majority class everywhere, as previously.\n",
    "- When `C` increases, the models starts to predict some datapoints from the\n",
    "  \"Chinstrap\" class but the model is not very confident anywhere in the\n",
    "  feature space.\n",
    "- The decision boundary is no longer a straight line: the linear model is now\n",
    "  classifying in the 100-dimensional feature space created by the `Nystroem`\n",
    "  transformer. As are result, the decision boundary induced by the overall\n",
    "  pipeline is now expressive enough to wrap around the minority class.\n",
    "- For `C = 1` in particular, it finds a smooth red blob around most of the\n",
    "  \"Chinstrap\" data points. When moving away from the data points, the model is\n",
    "  less confident in its predictions and again tends to predict the majority\n",
    "  class according to the proportion in the training set.\n",
    "- For higher values of `C`, the model starts to overfit: it is very confident\n",
    "  in its predictions almost everywhere, but it should not be trusted: the\n",
    "  model also makes a larger number of mistakes on the test set (not shown in\n",
    "  the plot) while adopting a very curvy decision boundary to attempt fitting\n",
    "  all the training points, including the noisy ones at the frontier between\n",
    "  the two classes. This makes the decision boundary very sensitive to the\n",
    "  sampling of the training set and as a result, it does not generalize well in\n",
    "  that region. This is confirmed by the (slightly) lower accuracy on the test\n",
    "  set.\n",
    "\n",
    "Finally, we can also note that the linear model on the raw features was as\n",
    "good or better than the best model using non-linear feature engineering. So in\n",
    "this case, we did not really need this extra complexity in our pipeline.\n",
    "**Simpler is better!**\n",
    "\n",
    "So to conclude, when using non-linear feature engineering, it is often\n",
    "possible to make the pipeline overfit, even if the original feature space is\n",
    "low-dimensional. As a result, it is important to tune the regularization\n",
    "parameter in conjunction with the parameters of the transformers (e.g. tuning\n",
    "`gamma` would be important here). This has a direct impact on the certainty of\n",
    "the predictions."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}